angular update
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- North America > United States (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
Spherical Motion Dynamics: Learning Dynamics of Normalized Neural Network using SGD and Weight Decay
In this paper, we comprehensively reveal the learning dynamics of normalized neural network using Stochastic Gradient Descent (with momentum) and Weight Decay (WD), named as Spherical Motion Dynamics (SMD). Most related works focus on studying behavior of equilibrium state, i.e. assuming weight norm remains unchanged. However, their discussion on why this equilibrium can be reached is either absent or less convincing. Our work directly explores the cause of equilibrium, as a special state of SMD. Specifically, 1) we introduce the assumptions that can lead to equilibrium state in SMD, and prove equilibrium can be reached in a linear rate regime under given assumptions; 2) we propose ``angular update as a substitute for effective learning rate to depict the state of SMD, and derive the theoretical value of angular update in equilibrium state; 3) we verify our assumptions and theoretical results on various large-scale computer vision tasks including ImageNet and MSCOCO with standard settings. Experiment results show our theoretical findings agree well with empirical observations. We also show that the behavior of angular update in SMD can produce interesting effect to the optimization of neural network in practice.
- North America > United States > Texas > Coleman County (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
Analyzing & Reducing the Need for Learning Rate Warmup in GPT Training
Kosson, Atli, Messmer, Bettina, Jaggi, Martin
Learning Rate Warmup is a popular heuristic for training neural networks, especially at larger batch sizes, despite limited understanding of its benefits. Warmup decreases the update size $\Delta \mathbf{w}_t = \eta_t \mathbf{u}_t$ early in training by using lower values for the learning rate $\eta_t$. In this work we argue that warmup benefits training by keeping the overall size of $\Delta \mathbf{w}_t$ limited, counteracting large initial values of $\mathbf{u}_t$. Focusing on small-scale GPT training with AdamW/Lion, we explore the following question: Why and by which criteria are early updates $\mathbf{u}_t$ too large? We analyze different metrics for the update size including the $\ell_2$-norm, resulting directional change, and impact on the representations of the network, providing a new perspective on warmup. In particular, we find that warmup helps counteract large angular updates as well as a limited critical batch size early in training. Finally, we show that the need for warmup can be significantly reduced or eliminated by modifying the optimizer to explicitly normalize $\mathbf{u}_t$ based on the aforementioned metrics.
- North America > United States (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- Europe > Switzerland (0.04)
- (2 more...)
Spherical Motion Dynamics: Learning Dynamics of Normalized Neural Network using SGD and Weight Decay
In this paper, we comprehensively reveal the learning dynamics of normalized neural network using Stochastic Gradient Descent (with momentum) and Weight Decay (WD), named as Spherical Motion Dynamics (SMD). Most related works focus on studying behavior of effective learning rate" inequilibrium" state, i.e. assuming weight norm remains unchanged. However, their discussion on why this equilibrium can be reached is either absent or less convincing. Our work directly explores the cause of equilibrium, as a special state of SMD. Specifically, 1) we introduce the assumptions that can lead to equilibrium state in SMD, and prove equilibrium can be reached in a linear rate regime under given assumptions; 2) we propose angular update" as a substitute for effective learning rate to depict the state of SMD, and derive the theoretical value of angular update in equilibrium state; 3) we verify our assumptions and theoretical results on various large-scale computer vision tasks including ImageNet and MSCOCO with standard settings. Experiment results show our theoretical findings agree well with empirical observations.
Spherical Motion Dynamics: Learning Dynamics of Neural Network with Normalization, Weight Decay, and SGD
Wan, Ruosi, Zhu, Zhanxing, Zhang, Xiangyu, Sun, Jian
In this work, we comprehensively reveal the learning dynamics of neural network with normalization, weight decay (WD), and SGD (with momentum), named as Spherical Motion Dynamics (SMD). Most related works study SMD by focusing on "effective learning rate" in "equilibrium" condition, where weight norm remains unchanged. However, their discussions on why equilibrium condition can be reached in SMD is either absent or less convincing. Our work investigates SMD by directly exploring the cause of equilibrium condition. Specifically, 1) we introduce the assumptions that can lead to equilibrium condition in SMD, and prove that weight norm can converge at linear rate with given assumptions; 2) we propose "angular update" as a substitute for effective learning rate to measure the evolving of neural network in SMD, and prove angular update can also converge to its theoretical value at linear rate; 3) we verify our assumptions and theoretical results on various computer vision tasks including ImageNet and MSCOCO with standard settings. Experiment results show our theoretical findings agree well with empirical observations.
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- Europe > Russia (0.04)
- Asia > Russia (0.04)
- Asia > China > Beijing > Beijing (0.04)